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Abstract 

Background: Horseradish peroxidases (HRPs) from Armoracia rusticana have long been utilized as reporters in 
various diagnostic assays and histochemical stainings. Regardless of their increasing importance in the field of life 
sciences and suggested uses in medical applications, chemical synthesis and other industrial applications, the HRP 
isoenzymes, their substrate specificities and enzymatic properties are poorly characterized. Due to lacking sequence 
information of natural isoenzymes and the low levels of HRP expression in heterologous hosts, commercially 
available HRP is still extracted as a mixture of isoenzymes from the roots of A rusticana. 

Results: In this study, a normalized, size-selected A rusticana transcriptome library was sequenced using 454 Titanium 
technology. The resulting reads were assembled into 14871 isotigs with an average length of 1 133 bp. Sequence 
databases, ORF finding and ORF characterization were utilized to identify peroxidase genes from the 14871 isotigs 
generated by de novo assembly. The sequences were manually reviewed and verified with Sanger sequencing of PCR 
amplified genomic fragments, resulting in the discovery of 28 secretory peroxidases, 23 of them previously unknown. 
A total of 22 isoenzymes including allelic variants were successfully expressed in Pichia pastoris and showed peroxidase 
activity with at least one of the substrates tested, thus enabling their development into commercial pure isoenzymes. 

Conclusions: This study demonstrates that transcriptome sequencing combined with sequence motif search is a 
powerful concept for the discovery and quick supply of new enzymes and isoenzymes from any plant or other 
eukaryotic organisms. Identification and manual verification of the sequences of 28 HRP isoenzymes do not only 
contribute a set of peroxidases for industrial, biological and biomedical applications, but also provide valuable 
information on the reliability of the approach in identifying and characterizing a large group of isoenzymes. 
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Background 

Horseradish peroxidases (HRPs) originating from the 
perennial herb Armoracia rusticana (Brassicaceae) are 
heme-containing monomeric glycoproteins belonging to 
the class III plant peroxidase subfamily [1]. These versatile 
enzymes have traditionally been utilized as reporters in 
various diagnostic assays and histochemical stainings but 
have also gained increasing interest in other life science 
and biotechnological applications ranging from cancer 
therapeutics [2], protein engineering [3] and transgenics 
[1] to bioremediation [4], biosensors [5] and biocatalysis 
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[6], The in vivo functions of HRPs have not been fully 
elucidated owing to the estimated large number of 
isoenzymes (up to 42) [7], but are known to be very 
diverse [8] thus offering a wide range of substrate 
specificities and applications. Although HRP has been 
studied for decades and in spite of the large diversity of 
this enzyme family, protein engineering and heterologous 
expression have mainly been focused on one single 
isoenzyme CIA, thus neglecting the potential of all 
others. This was largely due to the lack of sequence 
information and the low efficiency of HRP expression 
in heterologous hosts. Commercial preparations are 
still extracted from the roots of A. rusticana and 
therefore consist of a mixture of various isoenzymes. 
The quality of these preparations varies greatly and 
depends on several biotic and abiotic factors, such as 
seasonal change or origin. Chromatographic purification 
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is needed to isolate highly enriched isoenzyme fractions. 
Only very few purified isoenzymes were accessible so far 
and their substrate specificities and enzymatic properties 
are poorly characterized. 

The sequences of only eight isoenzymes are currently 
known: six nucleotide sequences (CIA, C1B, C1C, C2, 
C3, N) have previously been published [9-11] and two 
amino acid sequences A2 (P80679) [12] and E5 (P59121) 
[13] have been determined. However, for decades, HRP 
has been known as a large group of enzymes with versatile 
physical properties. Already in 1966, Shannon et al. [14] 
described the physical properties of seven isoenzymes. 
Aibara et al. [15,16] characterized five neutral isoenzymes 
(Bl, B2, B3, CI and C2) and six basic isoenzymes (E1-E6) 
in 1982 and 1981, respectively. A total number of 42 
peroxidase isoenzymes have been identified by isoelectric 
focusing in commercially available HRP preparations [7], 
without knowing whether these isoenzymes differ in 
amino acid sequence or due to different post-translational 
modifications. 

This study has two major goals. First, we resolve the 
transcriptome sequence of A. rusticana, a perennial plant 
of industrial and medical importance. The availability of a 
large expressed sequence tag (EST) collection is crucial to 
support annotation in possible future A. rusticana genome 
sequencing projects. Secondly, we demonstrate the use of 
an efficient enzyme discovery pipeline including new 
generation cDNA sequencing technologies, in silico 
isoenzyme discovery and experimental sequence verifica- 
tion, gene synthesis and enzyme production and secre- 
tion by Pichia pastoris (Komagataella pastoris) as a 
straightforward approach to discover and characterize 
new isoenzymes from plants or other eukaryotes. 
Transcriptomes deliver all sequences of expressed 
genes, at the same time avoiding sequencing introns 
and providing information about all expressed exons 
and alternative exon junctions. Studies in Arabidopsis 
thaliana, a model species of the Brassicaceae family, 
have demonstrated the power of massively parallel 
transcriptome sequencing in providing high-quality 
representation of transcripts needed for gene discovery 
[17]. Similar transcriptome sequencing approaches 
have previously been applied, for example, in marker 
development, population genomics [18] and predictions of 
biosynthetic pathways [19-23]. Over the course of the 
study, the new 'third-generation' and 'fourth-generation' 
sequencing techniques have revolutionized the speed and 
cost of the transcriptome sequencing projects [24]. 
Hundreds of transcriptomes have been sequenced and 
annotated, especially by the large sequencing projects 1KP 
[25-27], PhytoMetaSyn [28,29] and Medicinal Plant 
Genomics Resource [30,31]. However, the concept of 
novel isoenzyme discovery from the large bulk of 
sequences generated by next generation sequencing (NGS) 



technologies needs a full pipeline of efficient tools. This is 
the first study using NGS transcriptome sequencing to 
discover, discriminate and characterize large numbers of 
sequence verified isoenzymes of non-model plant origin. 
Although a similar study was recently performed to 
identify fungal cellulases [32], the method described 
utilized combined secretome and transcriptome analyses 
and was only aiming to show cellulose activity of the 
cloned cDNA without the need of the full verified 
sequences to make all discovered isoenzymes available by 
recombinant expression. The method described in this 
study can be widely applied for the replenishment of the 
sequence data in any eukaryotic organism including fungi, 
plants and animal cell lines or tissues when detailed 
sequence and gene structure information of enzymes and 
isoenzymes is needed. 

Results 

Sample preparation and cDNA library generation 

The high quality (RNA integrity number RIN 9.4-9.8) of 
the RNA samples was confirmed with an Agilent 2100 
bioanalyzer. In order to include the majority of all 
encoded isoenzymes, a mixture of RNA from diverse 
plant parts, including leaves, roots, sprouts and stems, 
was chosen for mRNA isolation. cDNA synthesis, 
normalization, size selection and cloning was performed by 
LGC Genomics (Berlin, Germany). The normalized cDNA 
library was subjected to quality control experiments before 
using it for 454 pyrosequencing: a cDNA fragment size of 
over 800 bp was ensured and the normalization efficiency 
was verified by sequencing 96 randomly selected clones 
(LGC Genomics). 

Sequencing and de novo assembly 

The normalized cDNA library was sequenced on half 
a picotiterplate run on the GS FLX using Roche 454/ 
Titanium chemistry. A total of 592507 sequence reads 
with an average read length of 353 ± 122 nucleotides 
were obtained. A total of 12798 (2.16%) clonal reads 
(exact, 3' or 5') were detected. Prior to assembly, the 
sequence reads were screened for the linker sequence 
used for concatenation, the linker sequences were clipped 
and the reads were quality checked (LGC Genomics). The 
resulting 556269 reads with an average length of 343 bp 
(Figure 1A) were further filtered by Newbler sequence 
filtering to ensure consistent high quality of the reads used 
in the assembly. 490285 reads were aligned to individual 
transcripts using Newbler version 2.5.3 with default 
settings. The de novo assembly generated 18511 contigs 
with an average length of 718 bp (Figure IB) and an 
average coverage of 11.4-fold (Figure 1C). The contigs 
were further processed to 14871 isotigs with an average 
length of 1133 bp (Figure ID). 35950 reads were left as 
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Figure 1 Overview of the sequencing and assembly of the A. rusticana transcriptome. (A) Size distribution of the quality-filtered reads. Total 
number of reads: 556269, average/median length 342.9/379.0 (B) Length distribution of the 1851 1 contigs. Average/median length of the contigs: 
717.7/667.0 (C) Coverage distribution of the 1851 1 contigs. Average/median coverage of the contigs: 1 1.4/7.8 (D) Length distribution of thel 4871 
isotigs. Average/median length of the isotigs: 1133.3/1113.0. 



singletons. A detailed summary of the alignment and 
assembly process is described in Table 1. 

Plant peroxidase search and manual validation of 
assembled transcripts 

The position-specific scoring matrix (PSSM) corresponding 
to known horseradish peroxidases (cd00693) was used 
in a tblastn search in the assembled transcripts. The 
settings allowed for a very permissive filtering of putative 
peroxidases, thus including many false positives, but 
avoiding the loss of valuable data for further analyses. This 
search yielded hits in 91 transcripts, which were classified 
in secretory peroxidases, ascorbate peroxidases, glutathione 
peroxidases and peroxidase-like proteins by definition of 
the Conserved Domains Database (CDD, NCBI [33]). 
All previously known HRPs were classified as secretory 
peroxidases, so only the contigs comprising a secretory 
peroxidase conserved domain were kept. The horseradish 
transcriptome contigs of the 18 resulting secretory 
peroxidases were manually reviewed. In this process, 



the coding sequences of four contigs were extended 
with available assembly data, three contigs were split 
because of strongly conflicting reads, and two more 
contigs were discarded because of only a partial domain 
match that could not be resolved into a full-length 
sequence. In total, 20 HRP genes, with allelic variants 
corresponding to 28 peroxidase isoenzymes, were identi- 
fied in the transcriptome of A. rusticana. This includes 
isoenzymes CIA and C3 which could be partially retrieved 
from the raw reads although they did not form a full- 
length contig. No read - even partially - corresponding to 
the previously published "neutral isoenzyme" N (Q42517) 
[11] was found. 

Sanger sequencing and genome walking 

Sequences yielded by the transcriptome assemblies are not 
necessarily error free but can include incorrect information 
either caused by the transcription and RNA editing 
machineries of the plant [34,35], introduced in the 
sequencing process or resulting from misassemblies. 
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Table 1 Summary of the transcriptome sequencing, 
assembly and enzyme discovery results 

Number of 
sequences 



Transcriptome sequencing and assembly 

Total number of reads 
Clipped reads 

Reads after Newbler quality control 

Reads aligned 

Reads assembled 

Reads partially assembled 

Singletons 

Contigs 

sogroups 

sotigs 

Average isotig length 

Largest isotig length 

Average contig coverage 

HRP isoenzyme discovery and verification 

# of isotigs with a secretory peroxidase domain 

# of full length peroxidase genes 

# of peroxidase genes after manual revision 
Total # of isoenzymes (including allelic variants) 
Successfully verified from gDNA 

Enzyme production in Pichia pastoris 

Synthetic genes for production 

Successful production of an active isoenzyme 



592,507 
556,269 
543,439 
490,285 
433,179 
57,064 
35,950 
18,511 
10,619 
14,871 
1,133 
3,659 
11.4 



20 
28 
26 

26 
22 



% of 
total 



100.0 
90.2 
88.4 
10.5 
7.3 



100 
39.7 



75.? 



reads could not be found in the genomic DNA sequenced. 
The sequence of the previously published "neutral 
isoenzyme" N (Q42517) not present in the transcriptome 
sequences could not be amplified from genomic DNA 
(gDNA) either. Therefore, it was not included in the 
following analyses or experiments. The sequences of the 
transcript # 22489 (see Table 2) could not be verified. 

GC content and codon usage 

The average GC content of all 14871 isotigs was calculated 
to be 42.7% (range 28%-62%) (Figure 2), almost identical to 
the average GC content, 42.5%, reported for A. thaliana 
[36]. This is lower than the average GC content of 45.1% 
reported by the Codon Usage Database (CUD) [37], 
suggesting that the small set of available A. rusticana genes 
(14 coding sequences) used in CUD for the calculation is 
not representative for the whole species. The GC content 
of the HRP isoenzymes was observed to vary between 
42.9% (C2, contig #04627) and 51.0% (contig #22489), with 
an average GC content of 47.1%. The results of the analysis 
are described in more detail in Table 3. 

The codon usages of the HRP genes and the previously 
known A. thaliana peroxidases were compared in the 
form of heatmaps (Additional file 1), depicting the fold 
change of the codon usage frequencies compared to 
the expected (1/64) frequency (ARSCU). The clustering of 
the isoenzymes according to their codon usage frequencies 
situated newly discovered isoenzymes with most divergent 
sequence and gene structure (HRPJS523, HRP_5508, 
HRPJ22489, HRP_17517, HRP_23190) also furthest away 
from the previously known group C isoenzymes. 



The sequences of all peroxidase genes detected in the 
isotigs were verified on genome level by Sanger sequencing 
of amplified genomic DNA. In addition, the sequences of 
the isoenzymes CIA and C3 available in the databases and 
partially also found in the raw reads were revised. In case 
of five full-length contigs where no sufficient read 
data from untranslated regions was available to enable 
amplification and sequencing of the complete gene, a 
genome walking approach was successfully performed in 
order to verify the 5' and/or 3' regions of the respective 
gene. Divergent coding sequence information was 
observed for ten genes in the form of possible allelic 
variants. PCR artifacts were ruled out by repeated 
experiments to increase the coverage of the positions. 
Thus, conflicting sequence information was postulated 
not to be due to sequencing errors, but rather due to 
the high sequence similarity of the HRP isoenzymes 
and the permissive settings used in the assembly. 
Supporting this assumption, putative allelic sequences 
could be found as separate raw reads. For altogether nine 
positions in contigs 22684, 6117, 17517 and 23190 
(Table 2) the nucleotide present in the transcriptome 



Gene structure and phylogenetic analyses 

Phylogenetic relationships of the HRP isoenzymes are 
shown in Figure 3 and Additional file 1. Interestingly, the 
previously known isoenzymes seem to be closely related 
to each other, while most of the new isoenzymes discov- 
ered in the transcriptome seem to share higher evolution- 
ary distance to them. From the 20 peroxidase gene loci, 15 
were confirmed to have three introns by comparing either 
transcript data or protein sequence data to the verified 
gDNA sequence. Further four genes (5508, 22489, 17517, 
23190) were noted to have only two introns and one gene 
(3523) no introns (Table 3). The number of introns 
correlates with the evolutionary distance so that genes 
having aberrant intron numbers were situated in separate 
branches close to each other in the phylogenetic topology. 
With the information obtained from the reads, no alterna- 
tive splicing could be shown. According to SignalP, all of 
the isoenzymes have an N-terminal signal sequence vary- 
ing in length from 18 amino acids to 31 amino acids. The 
lengths of the signal sequences are described in Table 3, 
and an alignment of the amino acid sequences of the HRP 
isoenzymes is shown in Additional file 2. 



Naatsaari et al. BMC Genomics 2014, 15:227 
http://www.biomedcentral.com/1471-2164/15/227 



Page 5 of 16 



Table 2 Comparison of the HRP isoenzyme sequences between GenBank, UniProt, transcriptome and verified 
genome sequences 



HRP 



Nt 



Sanger sequence 

aa (exon)/intron 



Transcriptome sequence 
nt aa 



GenBank sequence 
nt aa (exon)/intron 



UniProt sequence 
aa 



CIA 



C1B 



C1C 



C2 



C3 



A2 



E5 



01805 
22684 



01350 



TA 
109-110 
C1 1 59 
T/C253 
T/C859 

ss 
ntl-60 
C178 
A/T1335 
A/G1888 

C1889 
C/G1921 

CT 
1250-1251 
A1334 
G/T1294 
A/T1323 
T/C1484 
GT1541 

ss 
nt1-93 
AAT 
231-234 

GGA 
996-998 

AAT 
999-1001 

ACG 
1185-1187 
G/A1203 

AAT 
1335-1337 
SS 

ntl-81 
T419 
C422 
C545 
None 
G1611 
TGA 
1627-1629 
None 



Y37 

intron 
intron 
intron 
ss 

aa1-20 
R60 
intron 

T/A165 
A165 

Q/E176 
intron 

intron 
intron 
intron 
L231 
F250 
SS 

aa1-31 
N78 

G220 

N221 

T284 

A/T290 
N334 

SS 

aa1-27 
L82 
D83 
C124 
none 
R337 
D343 



ss 
nt1-60 
C178 

A/G493 
C/T1 889 
C/G526 



SS 

1-93 

AAT 
231-234 

GGA 
661-663 

AAT 
664-666 

ACG 
850-852 

G868 

AAT 
999-1002 
SS 

ntl-81 
C/T246 
T/C249 
T/C372 

none 
A1010 

CGG 
1026-1028 

none 



intron 
intron 
intron 

ss 
aa1-20 
R60 
intron 
T/A165 
A/V165 
Q/E176 
intron 

intron 
intron 
intron 



ss 
aal -31 
N47 

G220 

N221 

T284 

A290 
N334 

SS 

aa1-27 
L82 
D83 
C124 
none 
K337 
G343 



AT 
109-110 
G1159 
T253 
C859 



Al 18 



G433 



G466 



G1294 
A1323 
T1484 
C1541 



13/ 

intron 
intron 
intron 

S40 

A145 

E156 
intron 

intron 

ntron 

intron 

L231 

F250 



Y37 



S40 



A145 



E156 



L231 
F250 



D47 

N189 

G190 

L253 

A259 
D303 



Lbb 
D56 
C97 
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Table 2 Comparison of the HRP isoenzyme sequences between GenBank, UniProt, transcriptome and verified 
genome sequences (Continued) 



HRP 


Sanger sequence 


Transcriptome sequence 


GenBank sequence 


UniProt sequence 




Nt 


aa (exon)/intron 


nt 


aa 


nt aa (exon)/intron 


aa 


02021 


None 


none 


none 


none 


- 


- 


03523 


None 


none 


none 


none 


- 


- 


06117 


T30 


V10 


C/T30 


VI 0 


- 


- 




C1088 


I269 


T807 


1269 


- 


- 


17517 


T190 


Y64 


C190 


H64 


- 


- 




C1 1 57 


G282 


T846 


G282 


- 


- 




A1232 


K307 


G921 


K307 


- 


- 


08562.1 


None 


none 


none 


none 


- 


- 


08562.4 


None 


none 


none 


none 


- 


- 


23190 


T1345 


SI 09 


G1345 


S109 


- 


- 




C1423 


G135 


T1423 


G135 








T1842 


S222 


T/C1842 


S/P222 








C1850 


T224 


A/C1850 


T224 








A2221 


E348 


T/A2221 


V/E348 






04663 


None 


none 


none 


none 






06351 


None 


none 


none 


none 






05508 


G/A346 


A/T1 16 


G/A346 


A/T116 






22489 






G/A597 
G/T715 


T199 
A/S239 







All nucleotide (nt) and amino acid (aa) positions were calculated from the start ATG. Variations between the nt positions of the transcriptome sequence compared 
to other sequence sources are either due to deletions or intronic sequences in the other sources. "-" indicates that no sequence information is available from the 
respective source, "ss" indicates a putative signal sequence. Deletions or missing sequences are marked with "*". "N/NX" indicates a variation at position X. "None" 
indicates that no polymorphisms or differences between transcriptome and genome sequence were found. The isoenzymes CIA and C3 were detected in the 
transcriptome raw reads with only partial/low coverage (0-2x), thus no consensus sequence was formed. 



Heterologous protein production in Pichia pastoris 

Twenty-six HRP sequences including allelic variants were 
codon optimized for P. pastoris expression and ordered as 
synthetic fragments. Twenty-two of them showed activity 
with at least one of the substrates tested, thus verifying 
a successful expression in P. pastoris. The peroxidase 
activities of the produced isoenzymes were detected with 




GC % 

Figure 2 GC content distribution of the A. rusticana isotigs. GC 

content distribution of the A. rusticana isotigs varies from 28% (min) to 
62% (max) with a range of 35%. The average GC content of all isotigs 
is 42.72%. Mode (x-axis value) 43, mode value (y-axis value) 2376. 



four substrates having variable assay pH optima. The 
substrate or pH-specific performance (Table 4) of the 
isoenzymes suggests a wide range of possible applications 
for this versatile group of peroxidases. 

Discussion 

Although high throughput sequencing technologies and 
bioinformatics tools to handle the enormous amount of 
data generated have been rapidly developing in the recent 
past, the expressed sequence data of many organisms of 
wide importance are still not available. Our study demon- 
strates how NGS technologies can provide a rapid, low- 
cost basis for the discovery of isoenzymes required 
for specific industrial, medical or biological applications. 
Below we discuss the reliability of the approach in identify- 
ing and characterizing an important group of isoenzymes, 
challenges provided by the library generation, sequencing 
and assembly methods, and suitability of the data obtained 
from the pipeline for heterologous protein production 
without laborious manual verification of the sequences. 

A normalized cDNA collection originating from 
multiple A. rusticana tissues was sequenced with 454 
pyrosequencing technology. In comparison to the other 
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Table 3 Summary of the horseradish peroxidase isoenzymes and associated data produced during this study 



Contig 
number 


Name 
(* novel) 


Length 
nt 


GC 
content% 


Accession 
# 


CAI 
calculated 


Intron 

# 


Signal sequence 
length 


Disulfide bridges 
(EDBCP prediction) 


pi mature 
protein 




C\ A 
L I A 


I Uoz 


43 .oy 


HtyojoUU 


U.o I z 


3 


3U 


n 1 011 [aa /ioi ro7 3mi n 77 inoi 

Li i— yij i_44-4yj Ly/— juij \_\//— zuyj 


r rQ 

D.jy 


15901 


C1B 


1056 


43.94 


HE963801 


0.808 


3 


28 


[11-91] [44-49] [97-301] [177-209] 


5.84 


25148 


C1C* 


1059 


45.14 


HE963802 


0.809 


3 


29 


[11-91] [44-49] [97-301] [177-209] 


6.49 


25 1 48_2 


CI D* 


1059 


45.04 


HE963803 


0.810 


3 


29 


[11-91] [44-49] [97-301] [177-209] 


7.04 


04627 


C2 


1044 


42.91 


HE963804 


0.800 


3 


24 


[11-91] [44-49] [97-301] [177-209] 


8.56 


- 


C3 


1050 


46.76 


HE963805 


0.781 


3 


29 


[11-91] [44-49] [97-300] [177-209] 


/./I 


Manual 
assembly 


A2A* 


1011 


46.79 


HE963806 


0.761 


3 


31 


[11-91] [44-49] [97-299] [176-208] 


4.93 


Manual 
assembly 


Azd 


1 m 1 
I U I I 


4o.oy 


I— IPQA3Qr~17 


U./o I 


1 

z> 


3 1 


T1 1 G11 [AA /IOI T07 TOOT T1 7A TflQl 

Li i— yij [44-4yj |y/— zyyj [\/o— zuoj 


4.y^ 


U4jOz 


t J 


1 DA A 
I U44 


A& CM 
4D.U/ 


ntyOjoUo 


n 771 


J 


z/ 


n 1 qii \aa /iqi ro7 ^nm n 77 inai 
[i i — y i j |_44-4yj [y/— juuj [\//— zuyj 


O.C34 


01805 


1805* 


1065 


44.41 


HE963809 


0.797 


3 


31 


[11-91] [44-49] [97-301] [177-209] 


5.97 


22684 


22684.1* 


1050 


46.76 


HE963810 


0.770 


3 


29 


[11-91] [44-49] [97-300] [177-209] 


6.98 


22684_2 


22684.2* 


1050 


46.67 


HE96381 1 


0.772 


3 


29 


[11-91] [44-49] [97-300] [177-209] 


6.37 


01350 


1350* 


975 


50.97 


HE963812 


0.707 


3 


28 


[11-91] [44-49] [97-292] [176-201] 


8.67 


02021 


2021* 


996 


46.08 


HE963813 


0.788 


3 


29 


[11-89] [44-49] [95-297] 


9.46 


23190 


23190.1* 


1080 


49.26 


HE963817 


0.724 


2 


31 
42 


[1 1-92] [44-49] [98-293] [178-205] or 
[22-103] [55-60] [109-304] [189-216] 


8.40 
6.58 


23 1 90_2 


23190.2* 


1080 


49.17 


HE963817 


0.722 


2 


31 
42 


[11-92] [44-49] [98-293] [178-205] or 
[22-103] [55-60] [109-304] [189-216] 


8.60 
7.09 


04663 


4663* 


1077 


47.82 


HE963814 


0.748 


3 


31 


[11-91] [44-49] [97-299] [176-208] 


4.48 


06351 


6351* 


945 


43.39 


HE963816 


0.786 


3 


18 


[17-96] [50-55] [102-292] [180-206] 


6.37 


03523 


3523* 


960 


44.58 


HE963820 


0.761 


0 


22 


[11-92] [44-49] [98-293] [177-203] 


8.99 


05508 


5508.1* 


966 


49.28 


HE963815 


0.735 


2 


24 
30 


[11-87] [44-49] [93-287] [171-198] or 
[1 7-93] [50-55] [99-293] [1 77-204] 


8.49 
8.47 


05508_2 


5508.2* 


966 


49.38 


HE963815 


0.735 


2 


24 
30 


[11-87] [44-49] [93-287] [171-198] or 
[l 7-93] [50-55] L99-293J [I 77-204] 


8.49 
8.47 


22489J 


22489.1* 


978 


51.02 


HE963818 


0.726 


2 


23 
34 


[22-98] [55-60] [1 04-298] [1 82-209] or 
[H-87J L44-49J [93-287] [171-198] 


8.93 
8.51 


22489_2 


22489.2* 


978 


50.82 


HE963819 


0.727 


2 


23 
34 


[22-98] [55-60] [1 04-298] [1 82-209] or 
[11-87] [44-49] [93-287] [171-198] 


8.93 
8.51 


06117 


6117* 


1008 


47.02 


HE963821 


0.802 


3 


22 
jZ 


[11-91] [44-49] [97-298] [176-208] or 

m imi r^/i coi nn7 ^noi tiqa hqi 
[zl — IUIJ p4-jyj LIU/— duo] Lloo-zloJ 


5.52 

O. 1 D 


17517J 


17517.1* 


972 


48.46 


HE963822 


0.737 


2 


23 
24 


[12-88] [45-50] [94-296] [171-203] or 
[11-87] [44-49] [93-295] [170-202] 


9.49 
9.39 


17517_2 


17517.2* 


972 


48.56 


HE963823 


0.739 


2 


23 
24 


[12-88] [45-50] [94-296] [171-203] or 
[11-87] [44-49] [93-295] [170-202] 


9.52 
9.41 


08562_1 


08562.1* 


996 


47.09 


HE963824 


0.779 


3 


22 
28 


[11-91] [44-49] [97-298] [176-208] or 
[17-97] [50-55] [103-304] [182-214] 


9.01 
9.03 


08562_4 


08562.2* 


996 


47.79 


HE963825 


0.788 


3 


22 
28 


[1 1-91] [44-49] [97-298] [176-208] or 
[17-97] [50-55] [103-304] [182-214] 


9.00 
9.02 



The nucleotide sequences of 28 isoenzymes were submitted to EMBL. "*" indicates novel. "CAI" = codon adaptation index. If signal sequence predictions gave 
more than one alternative result, both signal sequence lengths are shown, separated by "/". Disulfide bridges were predicted using both alternatives of the 
mature protein (EDBCP = Ensemble-based Disulfide Bonding Connectivity Pattern prediction server}. 



NGS methods, 454 pyrosequencing produces long reads. 
This is of advantage when sequencing genes coding for 
isoenzymes and identifying exonic variants. Longer reads 
increase the probability to uniquely align a given read, 
which might be problematic with the short reads 



produced by other NGS methods [38]. Studies in 
A. thaliana suggested that with the high coverage attain- 
able by massively parallel sequencing, all transcripts can 
be well represented in the sequence data regardless of expres- 
sion levels [17]. However, the benefits of normalization in 
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Figure 3 Cladogram of all isoenzymes known and discovered during this study. The dendrogram was cut at a branch length of 0.21 and 
the resulting sub-trees were colored. All previously described isoenzymes are located in the red and light green trees, respectively; whereas all 
novel isoenzymes (except for 01805, 22684.1, 22684.2, and 04663) cluster in distinct sub-trees (black, blue, dark green and orange) indicating a 
larger sequence diversity. 



non-model species have not been well characterized, 
and a recent study by Cirulli et al. reports low identi- 
fication rates of exonic SNVs (single nucleotide varia- 
tions) in non-normalized transcriptome, if the genes 
of interest are not well-expressed in the source tissue 
[38]. Since also the genome and transcriptome sizes 
of A. rusticana and the sequence data required to 
reach also isoenzymes with low expression levels are 
unknown, the cDNA library sequenced in this study 
was normalized. Normalization of the cDNA has been 
reported to be especially important in gene discovery 
when the cDNA used for sequencing is pooled from 
many tissues or individuals [39-41], and to considerably 
reduce the frequency of abundant transcripts thus 
increasing the possibility to reach also unique 
transcripts of isoenzymes with low expression levels. 
Half a plate run on a GS FLX platform resulted in 
over 500000 high-quality reads, corresponding to a 
relatively high average coverage (>10x) of the assembled 
14871 isotigs with a high average length of 1133 bp. 
Comparable to previous transcriptome sequencing 
studies [40,41], 88% of the reads could be assembled 
into contigs. Many of the remaining singletons were 
of high quality and also represented an important 
source of information [18]. Singletons could either result 
from 454 sequencing errors or contaminants from plant 
parasites [42], they could be caused by over-efficient 
normalization methods, or simply represent rare 
transcripts with thin coverage despite the normalized 
cDNA pool used for sequencing. 



The 454 pyrosequencing technique generates relatively 
long reads including very few technical errors (mainly 
related to homopolymer runs), and is therefore well-suited 
for applications such as de novo transcriptome sequencing. 
Although the sequence length achieved by 454 Titanium 
FLX platform is still clearly shorter than by traditional 
Sanger sequencing, it has been reported to be adequate for 
reconstructing full length transcripts [17] and validated to 
be comparable in accuracy to Sanger sequencing [43,44]. 
The high average isotig length of 1133 bp and the assembly 
of 90% (18/20) full length peroxidase genes reached in this 
study supports the statement of adequate read length and 
coverage to detect complete transcripts. Only isoenzymes 
CIA and C3 were not assembled into a contig due to low 
sequence coverage. 

Although the error rates associated with NGS methods 
have been reported to be low, they could still cause 
problems in reliable sequence polymorphism detection. 
The requirement of >90% match used in this study, 
combined with a minimal match length of 40 bp was 
expected to provide a very high number of contigs without 
collapsing and joining similar isoenzymes into a single 
contig [18]. However, the isoenzyme sequences were 
known to be partially almost identical. To validate the 
assembly and ORF (open reading frame) prediction 
correctness, and the existence of allelic variants, the 
isoenzyme sequences were amplified from genomic DNA. 
The combination of transcriptome sequencing and Sanger 
sequencing of amplified genomic DNA revealed 38 
variable positions in the coding sequences of 11 genes. For 
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Table 4 Summary of isoenzyme expression and characterization 



Contig number 


Name (* previously unknown) 


Activity ABTS 


Activity TMB 


Activity guaiacol 


Activity pyrogallol 


- 


C1A 


+ 


+ 


+ 


+ 


15901 


C1B 


_ 


+ 


_ 


_ 


25148 


C1C* 


+ 


+ 


(+) 


+ 


25 1 48_2 


C1D* 


+ 


+ 


(+) 


+ 


04627 


C2 


+ 


+ 


+ 


+ 


_ 


C3 


+ 


+ 


+ 


+ 


Manual assembly 


A2A* 


+ 


+ 


+ 


+ 


Manual assembly 


A2B* 


+ 


+ 


(+) 


(+) 


04382 


E5 


+ 


+ 


+ 


+ 


01805 


01805* (B1?) 


+ 


+ 


- 


- 


22684 


22684.1* (B2A?) 


+ 


+ 


_ 




22684_2 


22684.2* (B2B?) 


+ 


+ 


_ 


_ 


01350 


01350* 


+ 


+ 


_ 




02021 


02021* 


- 


- 


- 


- 


23190 


23190.1* 


- 


- 


- 


- 


23190_2 


23190.2* 


n.d. 


n.d. 


n.d. 


n.d. 


04663 


04663.1* 


+ 


(+) 


- 


- 


06351 


06351* 


+ 


+ 


+ 


+ 


03523 


03523* 


- 


- 


- 


- 


05508 


05508.1* 


+ 


+ 


+ 


+ 


05508_2 


05508.2* 


n.d. 


n.d. 


n.d. 


n.d. 


22489J 


22489.1* 


+ 


+ 


(+) 


+ 


22489_2 


22489.2* 


+ 


+ 


(+) 


+ 


06117 


06117* 










17517J 


17517.1* 


+ 








17517_2 


17517.2* 


+ 


+ 


+ 


+ 


08562J 


08562.1* 


+ 


+ 






08562_4 


08562.2* 


+ 


+ 




+ 



Isoenzymes showing obvious peroxidase activity with the assay used are marked with "+". Isoenzymes showing very low but detectable peroxidase activity with 
the assay used are marked with "(+)". Isoenzymes with no activity detected during an observation period of 2 h are marked with Allelic variants not produced 
heterologously are marked with n.d. (no data available). Isoenzymes discovered during this study are marked with "*". 



nine positions thereof (in four transcripts) the correspond- 
ing nucleotide could not be found in the genomic DNA. 
Manual confirmation of the transcriptome reads revealed 
that the coverage of all positions was high and that the 
reads agreed. Closer analysis of the positions also ruled 
out false variations caused by too high coverage. At 
relatively low or moderate levels of coverage, sequencing 
mistakes are not pushed over. Too high sequencing depth 
increases noise and could have introduced, without very 
strict quality control, false variations [38]. Therefore, 
sequencing errors could be excluded as the source of 
the differences. The differences could be caused either 
by mismatches introduced by the reverse transcriptase 
enzyme during library generation, as a result of RNA 
editing events [34,35] or reflect the natural variation 
between individual plants and plant parts. The general 



error rate of the sequencing technique was noted to be 
very low, but isoenzymes coded by less common allelic 
variants would have been missed without manual 
sequence verification. Twenty-six of the resulting 
coding sequences, including allelic variants missed in 
the transcriptome sequencing and assembly processes, 
were codon optimized, synthesized and transformed 
to P. pastoris for expression. 

This study reveals that for the coverage of all isoenzymes 
including allelic variants represented by the cDNA library 
sequenced, manual work to verify the resulting transcript 
sequences cannot be avoided. However, the allelic variants 
represent only a minor part of the newly discovered 
enzymes. The quality of the sequences is very high 
and differences to genomic DNA minimal, confirming 
that the enzyme discovery method described in this 



Naatsaari et al. BMC Genomics 2014, 15:227 
http://www.biomedcentral.com/1471 -21 64/1 5/227 



Page 10 of 16 



study for high-throughput applications would not 
necessarily require manual verification of the sequencing 
by laborious Sanger sequencing of amplified genomic 
DNA. However, manual curation of the contigs of interest 
and splitting of the data in contigs with clear assembly 
conflicts can be done and could be worthwhile, as 
especially the additive effects of amino acid changes 
in collapsed contigs could cause problematic changes 
in the enzyme structure. The primary enzyme discovery 
pipeline utilized in this study provides a functional 
approach to find proteins of interest for heterologous 
production, giving an example of an affordable stan- 
dardized sequencing project. Despite the large amount 
of useful data produced by the NGS approach, our 
study showed that sequence confirmation and data 
validation should not be neglected. 

For the heterologous secretory production of single 
isoenzymes in P. pastoris, the codon usages of the 
coding sequences were optimized for efficient translation, 
and fragments corresponding to the predicted mature 
isoenzymes were produced synthetically. When the 
signal peptide prediction with SignalP led to two alternative 
signal peptide junctions, the mature peptides corresponding 
to the longer signal peptide variants were ordered as 
synthetic fragments, and the signal peptide variants 
with shorter mature peptide were successfully amplified 
via PCR. Correct cloning of all genes into P. pastoris 
expression vectors was verified by Sanger sequencing. 
Since the used P. pastoris expression vector already 
contains the signal sequence of the S. cerevisiae mating 
factor a, the isoenzymes were produced without the 
predicted natural signal sequences. Twenty-two isoenzymes 
showed peroxidase activity with at least one of the 
substrates used. All activities were measured with four 
different assays over a pH range from 4.5 to 7. Interestingly, 
for each assay (Table 4) a different isoenzyme showed 
the highest activity thus suggesting variable substrate 
specificities or pH optima. This observation further 
emphasizes the importance of the availability of a 
large group of individually produced pure isoenzymes to 
be able to comprehensively respond to the need of variable 
performance parameters including substrate specificity, 
activity, stability, and operating pH optimum. Four of 
the isoenzymes did not show peroxidase activity with the 
assays used. This could either be due to very low yields of 
active enzyme, totally inactive enzyme or unsuitable assay 
conditions. 

The isoenzyme CIA was reported to be the most 
abundant isoenzyme in A. rusticana [1] and was thus 
expected to be found in the transcriptome. However, 
only two raw reads covering a minor part of the coding 
sequence could be detected. This might either suggest 
over-normalization of the cDNA library decreasing the 
total counts of the putatively most abundant transcripts to 



almost zero [45], or happen due to naturally occurring gen- 
etic variation with phenotypic correlation to adaptations to 
natural environments ranging from pathogens, light 
conditions or abiotic stress to a variety of other environ- 
mental perturbations [8,46,47]. Although mRNA originating 
from all available plant parts was used to reach genes 
activated at diverse stages of A. rusticana growth, 
some developmental stages were not present and the 
absence of certain isoenzymes due to missing tissues 
cannot be ruled out. This finding might illustrate the 
high variance of HRP expression in A. rusticana plants 
and consequently the variance in the commercial HRP 
preparations, thus underlining the clear need for a reliable 
heterologous expression system that enables a consistent 
isoenzyme quality. 

Peroxidase isoenzymes have been suggested to have 
multiple roles in the plant and thus also be variedly 
expressed depending on both biotic and abiotic factors 
[8]. To roughly estimate the expression levels of the 
newly discovered peroxidase genes, their GC contents 
and codon usages were assessed. Genes that are highly 
expressed have been suggested to possess a higher GC 
content and a more biased codon usage than genes with 
low expression levels [48]. A majority of both eukaryotic 
and prokaryotic species with large population sizes have 
been reported to have non-random codon usage mainly 
due to Darwinian selection between synonymous codons 
[49-51]. Highly expressed genes have been reported to use 
a restricted set of codons to ensure optimal translational 
efficiency [52,53]. In addition to gene expression levels, 
GC content has also been connected to gene regulation 
[54-57] and correlated with genomic features including 
methylation pattern [58], short intron length [59] and 
gene density [60] thus suggesting possible functional 
relevance. The codon usages and GC contents calculated 
using the verified coding sequences of the isoenzymes are 
described in Table 3. As expected, large variation between 
isoenzymes exist. These findings could suggest a spatial 
and temporal distribution of the isoenzymes in cellular 
processes [61]. 

Phylogenetic relationships of the HRP isoenzymes 
are shown in Figure 3. Whereas the previously known 
isoenzymes are closely related to each other, most of 
the new isoenzymes discovered in the transcriptome 
share higher evolutionary distance to the previously 
known HRPs. BLASTX analysis to the peroxidases of 
A. thaliana (Additional file 3) revealed that the 
A. rusticana peroxidases share 81% to 95% sequence 
similarity to the most similar isoenzyme of A. thaliana. 
Evolutionary distance does not necessarily correlate 
with altered substrate specificity, specific activity or 
optimal reaction conditions, but the discovery of new 
evolutionary branches with higher structural diversity 
does offer optimal conditions for the generation of an 
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enzyme assortment with diverse properties for a wide 
variety of biomedical and industrial applications. 

A combination of cDNA sequencing and gDNA 
verification in this study also provided valuable information 
of the intron-exon boundaries of the HRP genes. The 
number of exons in the isoenzymes was noted to vary 
from one to four, corresponding to zero to three introns. A 
large majority (75% of the peroxidase loci) of the 
isoenzymes were found to have four exons and three 
introns. Intron numbers have been reported to be 
highly conserved, but total intron length (total sum of the 
sizes of all introns within a gene) rather correlated to the 
GC-content of the gene [62]. Thus, intron number could 
be informative in terms of evolutionary origin and 
distance of the enzyme. In this study, intron numbers were 
found to correlate with the phylogenetic relationships of 
the amino acid sequences. Contigs with an unusual number 
of introns (none or two, Table 3) were situated in close 
proximity to each other furthest away from the previously 
known isoenzymes and clustered together when comparing 
the codon usage frequencies. With the information 
obtained from the reads, no alternative splicing could 
be observed. 

The well-characterized isoenzyme HRP CIA has been 
reported to have a signal peptide consisting of 30 amino 
acids, and a carboxy- terminal extension suggested to 
target the protein to the vacuoles [63]. Also other known 
isoenzymes of the group C (C1B, C1C, C2, and C3) have 
been reported to have signal peptides varying in length 
from 9 amino acids (C1C) to 29 amino acids (C3). By 
observing the alignment of all previously known and 
newly discovered isoenzymes (Additional file 2), existence 
of signal sequences also in other previously known and 
most of the new isoenzymes seemed very probable. 
According to the signal sequence prediction (SignalP) 
performed, all isoenzymes seem to have a signal sequence 
varying in length from 18 to 31 amino acids (Table 3, 
Additional file 2). Isoenzyme C1C, previously reported to 
have a signal sequence of nine amino acids, was predicted 
to have - better corresponding to the sequences of the 
very closely related isoenzymes CIA and C1B - a signal 
sequence of 29 amino acids. In the case of unclear signal 
sequence prediction with more than one option for the 
length of the signal peptide, both forms were taken into 
consideration when planning the constructs for enzyme 
production in P. pastoris. 

Conclusions 

To facilitate the possibilities for heterologous expression 
and isoenzyme characterization, we have elucidated 
the nucleotide sequences of 28 horseradish peroxidase 
isoenzymes by using the data obtained from A. rusticana 
454 transcriptome sequence analysis with manual verifica- 
tion of PCR amplified genomic DNA. Although studies 



including transcriptome analysis of non-model species 
have become increasingly popular since the emergence of 
the NGS technologies, methods for the utilization of the 
454 technology for the purpose of isoenzyme discovery in 
non-model plant species have not been established. In this 
project, transcriptome sequencing reads are further 
processed with alternative assemblies and manual sequence 
verification to determine the nucleotide sequences of all 
HRP isoenzymes. This study does not only contribute a set 
of transcripts, which can be used for marker development 
and genomic studies to understand agriculturally important 
traits in A. rusticana, but also provides valuable infor- 
mation of the peroxidase gene structure. Twenty-two 
of the verified isoenzymes have been produced in a form 
that was found active towards the tested substrates in 
P. pastoris utilizing a new P. pastoris expression platform 
[64], validating the success of the approach and providing 
first insights into the versatility of this large group of 
isoenzymes discovered. 

Methods 

Plant specimens, RNA extraction and quality analysis 

Wild horseradish (A. rusticana) roots were purchased from 
local farmers and grown in the laboratory to obtain fresh 
roots, sprouts, stems and leaves. Tissues were collected in 
aliquots, frozen in liquid nitrogen and stored at -80°C. 
Total RNA from all available plant parts was isolated using 
RNaqueous kit (Applied Biosystems/Ambion, Austin, TX, 
USA) according to the manufacturer's recommendations. 
Quality assessment to ensure RNA integrity was performed 
with Agilent 2100 Bioanalyzer (Agilent Technologies, Santa 
Clara, CA, USA). 

Normalized cDNA library construction and sequencing 

Transcriptome sequence was obtained by a commercial 
service from LGC Genomics (Berlin, Germany). The 
methods used are roughly summarized as follows: mRNA 
was purified from total RNA using mRNA- ONLY™ 
Eukaryotic mRNA Isolation Kit (Epicentre, Madison, WI, 
USA). One ug of mRNA was used for first-strand cDNA 
synthesis and amplification according to the Mint- 
Universal cDNA Synthesis Kit user manual (Evrogen, 
Moscow, Russia), followed by a normalization reaction 
using the Trimmer Kit (Evrogen). Normalized material was 
re-amplified, digested (SfiT), size-selected (>800 bp, LMP 
agarose), purified (Qiagen, Hilden, Germany) and ligated 
to pDNR-lib vector (Clontech, Saint-Germain-en-Laye, 
France) using the Fast Ligation Kit (New England Biolabs, 
Ipswich, MA, USA). The desalted ligation was used to 
transform NEBlOb competent cells (New England Biolabs). 

Roughly a million clones were plated on LB (lysogeny 
broth) + chloramphenicol (Cm) plates, scraped off the 
plates and stored as glycerol stocks at -70°C. Plasmid 
DNA was prepared using standard methods (Qiagen, 
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Hilden, Germany), and digested with Sfil. cDNA inserts 
were gel-purified (LMP-Agarose/MinElute Gel Extraction 
Kit, Qiagen) and ligated to high-molecular-weight DNA 
using a proprietary S/H-linker. 

Library generation for the 454 FLX sequencing was 
carried out according to standard protocols (Roche/454 
Life Sciences, Branford, CT 06405, USA). In short, 
the concatenated inserts were sheared randomly by 
nebulization to fragments ranging in size from 400 bp 
to 900 bp. These fragments were end polished and the 
454 A and B adaptors that are required for the emulsion 
PCR and sequencing were ligated to the ends of the 
fragments. The resulting fragment library was sequenced 
on half a picotiterplate on the GS FLX using the Roche/ 
454 Titanium chemistry. Sequence data can be accessed 
via the EMBL-EBI European Nucleotide Archive under 
the study accession number PRJEB5793. 

Assembly of the sequence reads to transcripts 

Raw reads produced by the pyrosequencing process 
were screened for the S/H-linker that was used for 
concatenation and the linker sequences were clipped 
from the reads. Poly A/T sequences were mostly 
(~90%) removed with the linker. High-quality reads were 
selected using Newbler sequence filtering at default 
settings. The clipped, quality controlled reads were assem- 
bled into individual isotigs using the Roche/454 Newbler 
software (454 Life Sciences Corporation, version 2.5.3) with 
default settings (minimum read length 20, duplicate reads 
excluded, expected depth 0, seed step 12, seed length 16, 
seed count 1, minimum overlap length 40 bp, minimum 
overlap identity 90%, alignment identity score 2, alignment 
difference score -3). 

Discovery of peroxidases in the assembled contigs 

The PSSM (position-specific scoring matrix) corre- 
sponding to known horseradish peroxidases was 
obtained from NCBI's Conserved Domain Database 
(cd00693, CDD v3.01) [65]. It was used in a tblastn 
search (e-value cutoff le-5) [66] on the assembled 
contigs to yield a preliminary set of HRP candidate 
sequences. This set was refined by filtering sequences 
whose translation mapped back to a domain different 
to the original profile PSSM in an rpsblast classification 
of the entire CDD or had a bit score lower than the 
NCBI-specified threshold for a specific domain match. 
The read composition of the refined set of contigs 
was manually reviewed using SeqMan (DNASTAR, 
Madison, Wisconsin, USA). Isotigs where two apparently 
different variants were assembled into one contig by 
the assembler were split. Protein coding regions were 
extended using read information if two or more reads 
contained the same sequence. 



Genome walking and manual verification of horseradish 
peroxidase sequences 

The sequences of the identified peroxidase genes were 
manually verified by Sanger sequencing of PCR amplified 
genomic fragments using Phusion™ High-Fidelity 
Polymerase (Finnzymes Oy, Espoo, Finland) and primers 
listed in Additional file 4. If no flanking regions were 
available for primer design, genome walking [67] was 
utilized to clarify and complete the sequences of the 
C- and N-termini. Therefore, 2 ug aliquots of genomic 
DNA were singly digested with Bspl43l, Hindlll, Psul 
(Fermentas, St. Leon-Rot, Germany), BsaWl (New England 
Biolabs GmbH, Frankfurt am Main, Germany) or Xholl 
(Promega GmbH, Madison, WI, USA) in order to get 
fragments of 1 - 5 kb size. The digestion was stopped 
by heat inactivation of the enzymes, the fragmented 
DNA was precipitated with ethanol and the pellet 
was dissolved in 30 uL of distilled water. An adaptor 
was created by annealing adaptor strand 1 (5'-GTAA 
TACGACTCACTATAGGGCACGCGTGGTCGACGGCC 
CGGGCTGGT-3') either to adaptor strand 2.a (3-TCCC 
CGACCACTAG-5') for &/?143I/ftwI-/X/zoII-digested DNA, 
2.b (3-TCCCCGACCATTAA-5') for EsaWI-digested DNA 
or 2.c (3'-TCCCCGACCATCGA-5') for Hindlll- digested 
DNA. 

In the annealing reaction, adaptor strand 1 was 
mixed in 1:1 molar ratio with adaptor strand 2.a/2.b/2.c 
(i.e. 13.7 uL of 100-uM adaptor strand 1 + 4.0 uL of 
100 uM adaptor strand 2) and heated to 95 °C for 5 min. 
To anneal the two strands to a functional adaptor 
molecule, the mixture was allowed to slowly cool 
down to room temperature. The three differently 
annealed adaptors (1 + 2a, 1 + 2b, 1 + 2c) were ligated 
for three hours at room temperature with T4 DNA 
Ligase (Fermentas) to the digested DNA fragments, 
considering the specific 5' overhangs that have been 
created by the respective restriction enzymes. The 
ligation reaction was stopped by incubation for 5 min at 
70°C and 70 uL of TE (Tris-EDTA) buffer was added. Two 
gene-specific primers and two adaptor primers were 
designed. The gene-specific primers were designed to bind 
approximately 100 bp from the end of the known 
sequence, considering that no restriction site of the restric- 
tion enzymes used for DNA digestion laid between the 
primer-binding site and the end of the known sequence. 
AdaptorPrimerl (5'-GTAATACGACTCACTATAGGGC-3') 
and GeneSpecificPrimerl were used as a primer pair 
for a first PCR with 1 uL of the DNA + adaptor 
ligation product as template DNA. One uL of the first 
PCR mix was used as template for a second PCR with 
AdaptorPrimer2 (5'-ACTATAGGGCACGCGTGGT-3') and 
GeneSpecificPrimer2. This second primer pair was designed 
to bind within the first PCR product. Both PCR steps were 
performed with an elongation time of 50 seconds. 
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A gene-specific DNA fragment as product from the 
second PCR was isolated from a preparative agarose gel, 
purified (SV DNA extraction kit, Promega) and sent to 
Sanger sequencing (LGC Genomics GmbH, Berlin, 
Germany), using AdaptorPrimer2 and the corresponding 
GeneSpecificPrimer2. If unspecific primer binding or 
nucleotide polymorphisms were suspected, the PCR 
products were cloned into the pJET 1.2blunt vector 
(Genejet cloning kit, Promega), transformed to E. coli 
ToplO F' and plasmids from single colonies were 
isolated for sequencing to ensure the read consisted 
of only one allele. 

The resulting gene sequences were submitted to EMBL 
[68] under accession numbers HE963800-HE963825 
(Table 3). 

Codon usage, GC content, isoelectric point, signal 
sequence prediction, disulfide bridge prediction and 
phylogenetic analyses of the horseradish peroxidase 
isoenzymes 

Codon usages and GC contents of the HRP isoenzymes 
were analyzed using CAIcal [69,70] and Mega5 [71,72]. 
The sequences of the A. thaliana Class III peroxidases 
were downloaded from TAIR [73]. Sequences were aligned 
with ClustaIW2 [74,75]. The theoretical isoelectric points 
(pi) were calculated with ExPASy Compute pI/Mw tool 
[76]. Disulfide bridges were predicted with EDBCP tool 
[77-79]. Phylogenetic analyses were performed with CLC 
Main Workbench 6.6.2 (CLC bio, Aarhus Denmark) [80] 
and Mega5. The pylogentetic tree was generated with the 
"Create Tree" function of CLC Main Workbench 
using the UPGMA algorithm with a bootstrap analysis 
of 100 replicates, based on an alignment of the HRP 
amino acid sequences using the "Create Alignment" 
function with the following settings: Gap open cost: 10.0, 
Gap extension cost: 1.0, End gap cost: as any other, 
Alignment: Very accurate (slow). Signal sequences were 
predicted using SignalP 3.0 [81,82]. 

Gene synthesis and heterologous expression in Pichia 
pastoris 

The codon usages of 14 isoenzymes were optimized for 
the expression in P. pastoris using a novel algorithm 
(DNA2.0, Menlo Park, CA, USA, Mellitzer et al. 
manuscript in preparation). Further twelve isoen- 
zymes including allelic variants were optimized using 
GeneDesigner 1.1.4.1 (DNA2.0) in accordance to the 
P. pastoris codon usage described by Abad et al. [83]. 
Signal sequence variants were generated by PCR amplifi- 
cation and all HRP genes were cloned into the shuttle vec- 
tor pPpT4_alpha_S of a newly generated open source 
expression platform [64]. The vector pPpT4_alpha_S is a 
basic low-copy (1-5 copies/genome), zeocin™ resistance 
based expression vector for efficient secretory expression 



of heterologous proteins. Sanger sequencing of the 
plasmids verified successful cloning into the right frame. 
The linearized expression cassettes were transformed into 
P. pastoris wild-type CBS7435 based mut s strain using 
standard protocols [84], and selected on zeocin'"-containing 
plates. From each gene, 88 clones were picked to 96-well 
deep-well plates for cultivation and high-throughput 
screening of peroxidase activity. Two of the well expressing 
clones of each isoenzyme were streaked out to single 
colonies. Four single colonies of each clone were used for 
re-screening to estimate the reproducibility of the results. 
All media compositions and cultivation protocols used 
in this study were as previously described by [85]. 
Minimal media BMD1% (buffered minimal media with 
1% dextrose) was supplemented with 5 mM ferrous 
sulfate heptahydrate (Sigma-Aldrich Handels Gmbh, 
Vienna, Austria) to ensure sufficient iron supply for 
heme biosynthesis. 

Peroxidase assays 

ABTS (2,2'-azino-bis(3-ethylbenzthiazoline-6-sulfonic acid), 
TMB (3,3',5'5-tetramethyl benzidine), pyrogallol (1,2,3- 
trihydroxybenzene) and guaiacol (2-methoxyphenol) assays 
were used to detect peroxidase activity essentially as 
described in [3,86,87]. ABTS assays were performed 
in 50 mM sodium acetate buffer pH 4.5 with 1 mM ABTS 
and 0.0026% (v/v) H 2 0 2 . For the TMB stock solution, 
TMB was dissolved in DMSO to a concentration of 
4.16 mM. For the assay solution, TMB stock solution and 
30% (v/v) H 2 0 2 were diluted with 20 mM citrate buffer 
pH 5.5 to final concentrations of 0.416 mM and 0.006% 
(v/v), respectively. Guaiacol assays were performed in 
10 mM sodium phosphate buffer pH 7.0 with 5 mM 
guaiacol and 0.0009% (v/v) H 2 0 2 . For the pyrogallol assay 
solution, pyrogallol (Sigma-Aldrich Handels Gmbh, Vienna, 
Austria) was dissolved in 10 mM potassium phosphate 
buffer pH6.0 containing 0.027% (v/v) H 2 0 2 to a concentra- 
tion of 45 mM. For all assays, 15 ul cultivation supernatant 
was mixed with 140 ul of the assay solution in a flat-bottom 
96-well microtiterplate (Greiner Bio-One GmbH, 
Frickenhausen, Germany). The reaction kinetics were 
followed with Spectramax Plus 384 spectrophotometer 
and SoftMax" Pro software (Molecular Devices, LLC) 
for 3-5 min at wavelengths 405 nm (ABTS), 650 nm 
(TMB), 470 nm (guaiacol) and 420 nm (pyrogallol). 
Enzyme activity was calculated using only time points 
fitting to linear increase of the absorbance (AmAU min 1 ). 

Availability of supporting data 

A. rusticana transcriptome sequencing data is available via 
EMBL-EBI's European Nucleotide Archive (ENA) under 
the study accession number PRJEB5793 (http://www.ebi. 
ac.uk/ena/data/view/PRJEB5793). Nucleotide sequences of 
the novel identified HRPs have been deposited into 
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ENA as well: accession numbers HE963800-HE963825 
(http://www.ebi.ac.uk/ena/data/view/HE963800-HE963825). 
All other supporting data are included as additional files 
with this manuscript. 

Additional files 
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